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Abstract 

We design, implement, and evaluate GPU-based algorithms for the 
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maximum cardinality matching problem in bipartite graphs. Such al- 
gorithms have a variety of applications in computer science, scientific 
i O i computing, bioinformatics, and other areas. To the best of our knowl- 
edge, ours is the first study which focuses on GPU implementation of 
y—i the maximum cardinality matching algorithms. We compare the pro- 
posed algorithms with serial and multicore implementations from the 
literature on a large set of real-life problems where in majority of the 
£^ cases one of our GPU-accelerated algorithms is demonstrated to be 

faster than both the sequential and multicore implementations. 

cn 

^ 1 Introduction 



Bipartite graph matching is one of the fundamental problems in graph the- 
ory and combinatorial optimization. The problem asks for a maximum set 
of vertex disjoint edges in a given bipartite graph. It has many applica- 
tions in a variety of fields such as image processing [T7], chemical struc- 
ture analysis |15| . and bioinformatics [2] (see also another two discussed by 
Burkard et al. [H Section 3.8]). Our motivating application lies in solving 
sparse linear systems of equations, as algorithms for computing a maximum 
cardinality bipartite matching are run routinely in the related solvers. In 
this setting, bipartite matching algorithms are used to see if the associated 
coefficient matrix is reducible; if so, substantial savings in computational 
requirements can be achieved [3 Chapter 6]. 

Achieving good parallel performance on graph algorithms is challenging, 
because they are memory bounded, and there are poor localities of the mem- 
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ory accesses. Moreover, because of the irregularity of the computations, it 
is difficult to exploit concurrency. Algorithms for the matching problem are 
no exception. There have been recent studies that aim to improve the per- 
formance of matching algorithms on multicore and manycore architectures. 
For example, Vasconcelos and Rosenhahn |18j propose a GPU implementa- 
tion of an algorithm for the maximum weighted matching problem on bi- 
partite graphs. Fagginger Auer and Bisseling [TO] study an implementation 
of a greedy graph matching on GPU. Catalyiirek et al. [5] propose different 
greedy graph matching algorithms for multicore architectures. Azad et al. [1] 
introduce several multicore implementations of maximum cardinality match- 
ing algorithms on bipartite graphs. 

We propose GPU implementations of two maximum cardinality match- 
ing algorithms. We analyze their performance and employ further improve- 
ments. We thoroughly evaluate their performance with a rigorous set of 
experiments on many bipartite graphs from different applications. The ex- 
perimental results conclude that one of the proposed GPU-based implemen- 
tation is faster than its existing multicore counterparts. 

The rest of this paper is organized as follows. The background material, 
some related work, and a summary of contributions are presented in Sec- 
tion]^ Section [3] describes the proposed GPU algorithms. The comparison 
of the proposed GPU-based implementations with the existing sequential 
and multicore implementations is given in Section |4j Section [5] concludes 
the paper. 

2 Background and contributions 

A bipartite graph G = (Vi U V2,E) consists of a set of vertices V\ U V% 
where V\C\Vi = 0, and a set of edges E such that for each edge, one of the 
endpoints is in V\ and other is in V%- Since our motivation lies in the sparse 
matrix domain, we will refer to the vertices in the two classes as row and 
column vertices. 

A matching Ai in a graph G is a subset of edges E where a vertex 
in V± U V2 is in at most one edge in Ai. Given a matching Ai, a vertex 
v is said to be matched by Ai if v is in an edge of Ai, otherwise v is 
called unmatched. The cardinality of a matching Ai, denoted by \Ai\, is 
the number of edges in Ai. A matching Ai is called maximum, if no other 
matching Ai' with \A4'\ > \A4\ exists. For a matching Ai, a path V in G is 
called an A'l-alternating if its edges are alternately in Ai and not in Ai. An 
A4-alternating path V is called M-augmenting if the start and end vertices 
of V are both unmatched. 

There are three main classes of algorithms for finding the maximum car- 
dinality matchings in bipartite graphs. The first class of algorithms is based 
on augmenting paths (see a detailed summary by Duff et al. [8]). Push- 
relabel-based algorithms form a second class |12| . A third class, pseudoflow 
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algorithms, is based on a more recent work [T3]. There are 0{y/nr) al- 
gorithms in the first two classes (e.g., Hopcroft-Karp algorithm |14j and 
a variant of the push relabel algorithm |TI]), where n is the number of 
vertices and r is the number of edges in the given bipartite graph. This 
is asymptotically best bound for practical algorithms. Most of the other 
known algorithms in the first two classes and in the third class have the 
running time complexity of 0{nr). These three classes of algorithms are 
described and compared in a recent study [E]. It has been demonstrated 
experimentally that the champions of the first two families are comparable 
in performance and better than that of the third family. Since we investigate 
GPU acceleration of augmenting-path-based algorithms, a brief description 
of them is given below (the reader is invited to two recent papers [HI HE] arid 
the original resources cited in those papers for other algorithms). 

Algorithms based on augmenting paths follow the following common 
pattern. Given an initial matching M (possible empty), they search for an 
.M-augmenting path V. If none exists, then the matching M is maximum 
by a theorem of Berge [3j. Otherwise, the augmenting path V is used to 
increase the cardinality of M. by setting M. = M. © E(V) where E(V) is the 
edge set of the path V, and M © E(V) = (M U E(V)) \ (M n E(V)) is the 
symmetric difference. This inverts the membership in Ai for all edges of V . 
Since both the first and the last edge of V were unmatched in M, we have 
\M(BE(V)\ = \M \ + 1. The augmenting-path-based algorithms differ in the 
way these augmenting paths are found and the associated augmentations are 
realized. They mainly use either breadth-first-search (BFS), or depth-first- 
search (DFS), or combination of these two techniques to locate and perform 
the augmenting paths. 

Multicore counterparts of a number of augmenting-path based algo- 
rithms are proposed in a recent work [IJ. The parallelization of these al- 
gorithms is achieved by using atomic operations at BFS and/or DFS steps 
of the algorithm. Although using atomic operations might not harm the 
performance on a multicore machine, they should be avoided in a GPU im- 
plementation because of very large number of concurrent thread executions. 

As a reasonably efficient DFS is not feasible with GPUs, we accelerate 
two BFS-based algorithms, called HK [Hj and HKDW [9j . HK has the best 
known worst-case running time complexity of 0{^/nT) for a bipartite graph 
with n vertices and r edges. HKDW is a variant of HK and incorporates 
techniques to improve the practical running time while having the same 
worst-case time complexity. Both of these algorithms use BFS to locate the 
shortest augmenting paths from unmatched columns, and then use DFS- 
based searches restricted to a certain part of the input graph to augment 
along a maximal set of disjoint augmenting paths. HKDW performs an- 
other set of DFS-based searches to augment using the remaining unmatched 
rows. As is clear, the DFS-based searches will be a big obstacle to achieve 
efficiency. In order to overcome this hurdle, we propose a scheme which al- 
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ternates the edges of a number of augmenting paths with a parallel scheme 
that resembles to a breadth expansion in BFS. The proposed scheme offers a 
high degree of parallelism but does not guarantee a maximal set of augmen- 
tations, potentially increasing the worst case time complexity to 0(nr) on 
a sequential machine. In other words, we trade theoretical worst case time 
complexity with a higher degree of parallelism to achieve better practical 
running time with a GPU. 

3 Methods 

We propose two algorithms for the GPU implementation of maximum car- 
dinality matching. These algorithms use BFS to find augmenting paths, 
speculatively perform some of them, and fix any inconsistencies that can be 
resulting from speculative augmentations. 

The overall structure of the first GPU-based algorithm is given in Al- 
gorithm [TJ APsB. It largely follows the common structure of most of the 
existing sequential algorithms, and corresponds to HK. It performs a com- 
bined BFS starting from all unmatched columns to find unmatched rows, 
thus locating augmenting paths. Some of those augmentations are then re- 
alized using a function called Alternate (will be described later). The 
parallelism is exploited inside the InitBfsArray, BFS, Alternate, and 
FixMatching functions. Algorithm [T] is given the adjacency list of the 
bipartite graph with its number of rows and columns. Any prior match- 
ing is given in r match and cmatch arrays as follows: rmatch[r] = c and 
cmatch[c] = r, if the row r is matched to the column c; rmatch[r] = —1, if 
r is unmatched; cmatch[c] = — 1, if c is unmatched. 

Algorithm 1: SHORTEST AUGMENTING PATHS (APsB) 
Data: cxadj, cadj, nc, nr, rmatch, cmatch 

1 augmenting -path- found <— true; 

2 while augmenting jpath- found do 
bfsJevel <— L ; 

InitBfsArray(&/ s.array, cmatch, L ); 
vertex -inserted <— true; 

6 while vertex -inserted do 

7 predecessor 4— Bfs(&/s -level, bfs-array, cxadj, cadj, nc, rmatch, 

vertex -inserted, augmenting _path_f ound) ; 

if augmenting jpath_ found then 
I break; 

bfsJevel <— bfsJevel + 1; 

(cmatch, rmatch) 4— Alternate (cmatch, rmatch, nc, predecessor); 
(cmatch, rmatch) 4— FixMatching (cmatch, rmatch); 

The outer loop of Algorithm [T] iterates until no more augmenting paths 
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are found, thereby guaranteeing a maximum matching. The inner loop is re- 
sponsible from completing the breadth-first-search of the augmenting paths. 
A single iteration of this loop corresponds to a level of BFS. The inner loop 
iterates until all shortest augmenting paths are found. Then, the edges in 
these shortest augmenting paths are alternated inside Alternate function. 
Unlike the sequential HK algorithm, APsB does not find a maximal set of 
augmenting paths. 

By removing the lines 9 and 10 of Algorithm [TJ another matching algo- 
rithm is obtained. This method will continue with the BFSs until all possible 
unmatched rows are found; it can be therefore considered as the GPU im- 
plementation of the HKDW algorithm. This variant is called APFB. 



Algorithm 2: BFS Kernel Function-1 (GPUBFS) 

Data: bfsJevel, bfs^array, cxadj, cadj, nc, rmatch, 
vertex -inserted, augmenting jpath_f ound 

1 process -cnt <— getProcessC ount(nc); 

2 for i from to process-cnt — 1 do 
coljuertex <-ix tot -thread _num + tid; 
if bf s -array[col -vertex] = bfsJevel then 

for j from cxadj[col juertex] to cxadj [coljuertex + 1] do 
neighbor jtow <— cadj[j]; 
col-match <— rmatch[neighbor _row]; 
if col _match > — 1 then 

if bf S-array[col jmatch] = Lq — 1 then 
vertex -inserted <— true; 
bf S-array[col jmatch] «— bfsJevel + 1; 
predeccesor[neighbor -row] <— col-vertex; 



else 



if col jmatch =—1 then 

rmatch[neighborjrow] < 2; 

predeccesor[neighbor _row] coljuertex; 
augmenting _path_f ound <— true; 



We propose two implementations of the BFS kernel. Algorithm [2] is the 
first one. The BFS kernel is responsible from a single level BFS expansion. 
That is, it takes the set of vertices at a BFS level and adds the union of the 
unvisited neighbors of those vertices as the next level of vertices. Initially, 
the input bfs-array filled with bfs-array[c] = Lq — 1 if cmatch[c] > — 1 and 
bfs-array[c] = Lq if cmatch[c] = — 1 by a simple InitBfs Array kernel (Lo 
denotes BFS start level). 

The GPU threads partition the columns vertices in a single dimension. 
Each thread with id tid is assigned a number of columns which is obtained 
via the following function: 
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getProcessC ount{nc) 



I tot-thread-num 

I , , ,, nc , I otherwise 

L tot-tnread-num J 



if tid < nc mod tot-thread-num, 



- tot-thread-num J 

Once the number of columns are obtained, the threads traverse their 
first assigned column vertex. The indices of the columns assigned to a 
thread differ by tot-thread-num to allow coalesced global memory accesses. 
Threads traverse the neighboring row vertices of the current column, if its 
BFS level is equal to the current bfsJevel. If a thread encounters a matched 
row during the traversal, its matching column is retrieved. If the column is 
not traversed yet, its bfsJevel is marked on bf sjarray. On the other hand, 
when a thread encounters an unmatched row, an augmenting path is found. 
In this case, the match of the neighbor row is set to —2, and this information 
is used by Alternate later. 



Algorithm 3: Alternate 



Data: cmatch, rmatch, nc, nr, predecessor 

1 process jvcnt 4- getProcessC ount{nr); 

2 for i from to process jucnt - 



1 do 



3 
4 
5 
6 
7 
8 
9 

10 
11 
12 



tid; 



row .vertex <-ix tot_thread_num - 
if rmatch[row -vertex] = —2 then 
while row ^vertex ^ — 1 do 

matched-col predecessor[row -vertex]; 
matched _row -s— cmatch[matched_col] ; 
if predecessor[matched-row] = matched_col then 
break; 

cmatch[matched-Col] <— rowjuertex; 
rmatch[row Juertex] <— matchedjzol; 
row ^vertex <— matched-row; 




Figure 1: Vertices r\ and C2 are matched; others are not. Two augmenting 
paths starting from c\ are possible. 

Algorithm [3] gives the description of the Alternate function. This ker- 
nel alternates the matching edges with unmatching edges of the augmenting 
paths found; some of those paths end up being augmenting ones and some 
are only partially alternated. Here, each thread is assigned a number of 
rows. Since rmatch of an unmatched row (that is also an endpoint of an 
augmenting path) has been set to —2 in the BFS kernel, only the threads 
whose row vertex's match is —2 start Alternate. Since there might be 
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several augmenting paths for an unmatched column, race conditions while 
writing on cmatch and rmatch arrays are possible. Such a race condition 
might cause infinite loops (inner while loop) or inconsistencies, if care is 
not taken. We prevent these by checking the predecessor of a matched row 
(line-8). For example, in Fig. [TJ two different augmenting paths that end 
with r2 and r% are found for c\ . If the thread of T2 starts before the thread 
of r3 in Alternate, the match of C2 will be updated to r2 (line-10). Then, 
r^s thread will read matched -row of C2 as T2 (line- 7). This would cause an 
infinite loop without the check at line-8. Inconsistencies may occur when the 
threads of T2 and r% are in the same warp. In this case, the if-check will not 
hold for both threads, and their row vertices will be written on cmatch (line- 
10). Since only one thread will be successful at writing, this will cause an 
inconsistency. Such inconsistencies are fixed by FixMatching kernel which 
implements: rmatch[r] i 1 for any r satisfying cmatch[rmatch[r]] ^ r. 

Algorithm [4] gives the description of a slightly different BFS kernel func- 
tion. This function takes root array as an extra argument. Initially, the 
root array is filled with rooi[c] = if cmatch[c] > —1, and root[c] = c if 
cmatch[c] = — 1. This array holds the root (as the index of the column ver- 
tex) of an augmenting path, and this information is transferred down during 
BFS. Whenever an augmenting path is found, the entry in bf sjarray for the 
root of the augmenting path is set to Lq — 2. This information is used at the 
beginning of BFS kernel. No more BFS traversals is done, if an augmenting 
path is found for the root of the traversed column vertex. Therefore, while 
the method increases the global memory accesses by introducing an extra 
array, it provides an early exit mechanism for BFS. 

We further improve GPUBFS-WR by making use of the arrays root and 
bfs.array. BFS kernels might find several rows to match with the same 
unmatched column, and set rmatch[-] to —2 for each. These cause Al- 
ternate to start from several rows that can be matched with the same 
unmatched column. Therefore, it may perform unnecessary alternations, 
until these augmenting paths intersect. Conflicts may occur at these inter- 
section points (which are then resolved with FixMatching function). By 
choosing Lq as 2, we can limit the range of the values that bfs_array takes to 
positive numbers. Therefore, by setting the bf s.array to — (neighbor _row) 
at line 19 of Algorithm |4j we can provide more information to the Alter- 
nate function. With this, Alternate can determine the beginning and 
the end of an augmenting path, and it can alternate only among the correct 
augmenting paths. APsB-GPUBFS-WR (and Alternate function used 
together) includes these improvements. However, they are not included in 
APFB-GPUBFS-WR since they do not improve its performance. 

4 Experiments 

The running time of the proposed implementations are compared against 
the sequential HK and PFP implementations [8], and against the multicore 
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Algorithm 4: BFS Kernel Function-2 (GPUBFS-WR) 

Data: bfsJevel, bfsMrray, cxadj, cadj, nc, rmatch, root 
vertex -inserted, augmenting jpath -found 

1 process -cnt getProcessCount{nc); 

2 for i from to process-cnt — 1 do 
coljvertex <— i x tot -thread jnum + tid; 
if bfs-array[coLvertex] = bfsJevel then 

my Root <— root[col -Vertex]; 
if bfs-array[myRoot] < Lq — 1 then 
j continue; 

for j from cxadj[col -vertex] to aradj [coLvertex + 1] do 
neighbor _row <— cadj[j]; 
coljmatch 4— rmatch[neighbor _row]; 
if coljmatch > — 1 then 

if bf s_array[coljmatch] = Lq — 1 then 
vertex -inserted <— true; 
bfs-array[col -match] <- bfsJevel + 1; 
root[col jmatch] 4— my Root; 
predeccesor[neighbor_row] 4— coljvertex; 



else 



if col _match =—1 then 

bfs-array[myRoot] <— Lq — 2; 

rmatch[neighborjrow] < 2; 

predeccesor[neighborjrow] 4— coljvertex; 
augmenting jpath- found <— true; 
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Table 1: Geometric mean of the running time of the GPU algorithms on 
different sets of instances. 





APFB 


APsB 




GPUBFS 


GPUBFS- WR 


GPUBFS 


GPUBFS- WR 




MT 


CT 


MT 


CT 


MT 


CT 


MT 


CT 


CLS1 


2.96 


1.89 


2.12 


1.34 


3.68 


2.88 


2.98 


2.27 


CLHardest20 


4.28 


2.70 


3.21 


1.93 


5.23 


4.14 


4.20 


3.13 


RCP_S1 


3.66 


3.24 


1.13 


1.05 


3.52 


3.33 


2.22 


2.14 


RCP_Hardest20 


7.27 


5.79 


3.37 


2.85 


12.06 


10.75 


8.17 


7.41 



parallel implementations P-PFP, P-DBFS, and P-HK [lj. The CPU im- 
plementations are tested on a computer with 2.27GHz dual quad-core Intel 
Xeon CPUs with 2-way hyper-threading and 48GB main memory. The algo- 
rithms are implemented in C++ and OpenMP. The GPU implementations 
are tested on NVIDIA Tesla C2050 with usable 2.6GB of global memory. 
C2050 is equipped with 14 multiprocessors each containing 32 CUDA cores, 
totaling 448 CUDA cores. The implementations are compiled with gcc- 
4.4.4, cuda-4.2.9 and -02 optimization flag. For the multicore algorithms, 8 
threads are used. A standard heuristic (called the cheap matching, see [8]) 
is used to initialize all tested algorithms. We compare the running time of 
the matching algorithms after this common initialization. 

Two different main algorithms APFB and APsB can use two differ- 
ent BFS kernel functions (GPUBFS and GPUBFS- WR). Moreover, each 
of these algorithms can have two versions (i) CT: uses a constant number 
of threads with fixed number of grid and block size (256 x 256) and as- 
signs multiple vertices to each thread; (ii) MT: tries to assign one vertex to 
each thread. The number of threads used in the second version is chosen as 
MT = min(nc, ^threads) where nc is the number of columns, and ^threads 
is the maximum number of threads of the architecture. Therefore, we have 
eight GPU-based algorithms. 

The algorithms are run on bipartite graphs corresponding to 70 different 
matrices from variety of classes at UFL matrix collection [6J. We also per- 
muted the matrices randomly by rows and columns and included them as a 
second set (labeled RCP). These permutations usually render the problems 
harder for the augmenting-path-based algorithms [8j. For both sets, we re- 
port the performance for a smaller subset which contains those matrices in 
which at least one of the sequential algorithms took more than one second. 
We call these sets 0_S1 (28 matrices) and RCP_S1 (50 matrices). We also 
have another two subsets called O_Hardest20 and RCP_Hardest20 that con- 
tain the set of 20 matrices on which the sequential algorithms required the 
longest running time. 

First, we compare the performance of the proposed GPU algorithms. 
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(a) Hamrle3 (b) Delanuay_n23 

Figure 2: The BFS ids and the number of kernel executions for each BFS 
in APsB and APFB variants for two graphs. The x axis shows the id of 
the while iteration at line 2 of APsB. The y axis shows the number of the 
while iterations at line 6 of APsB. 

Table [T] shows the geometric mean of the running time on different sets. 
As we see from the table, using constant number of threads (CT) always 
increases the performance of an algorithm, since it increases the granularity 
of the work performed by each thread. GPUBFS-WR is always faster than 
GPUBFS. This is because of the unnecessary BFS traversals in the GPUBFS 
algorithm. GPUBFS cannot determine whether an augmenting path has 
already been found for an unmatched column, therefore it will continue 
to explore. This unnecessary BFS traversals not only increase the BFS 
time, but also reduce the likelihood of finding an augmenting path for other 
unmatched columns. Moreover, the Alternate scheme turns out to be 
more suitable for APFB than APsB, in which case it can augment along more 
paths (there is a larger set of possibilities). For example, Figs. 2(a)| and 2(b)| 
show the number of BFS iterations and the number of BFS levels in each 
iteration for, respectively, Hamrle3 and Delanuay_n23 graphs. As clearly 
seen from both of the figures, APFB variants converges in smaller number 
of iterations than APsB variants; and for most of the graphs, the total 



number of BFS kernel calls are less for APFB (as in Fig. 2(a)). However 



for a small subset of the graphs, although the augmenting path exploration 
of APsB converges in larger number of iterations, the numbers of the BFS 



levels in each iterations are much less than APFB (as in Fig. 2(b)). Unlike 



the general case, APsB outperforms APFB in such cases. Since APFB using 
GPUBFS-WR and CT almost always obtains the best performance, we only 
compare the performance of this algorithm with other implementations in 
the following. 

Figures 3(a) and |3(b)| give the log-scaled speedup profiles of the best 
GPU and multicore algorithms on the original and permuted graphs. The 
speedups are calculated with respect to the fastest of the sequential algo- 
rithms PFP and HK (on the original graphs HK was faster; on the permuted 
ones PFP was faster). A point (x,y) in the plots corresponds to the prob- 
ability of obtaining at least 2 X speedup is y. As the plots show, the GPU 
algorithm has the best overall speedup. It is faster than the sequential HK 
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(a) Original graphs (b) Permuted graphs 

Figure 3: Log-scaled speedup profiles. 



algorithm for 86% of the original graphs, while it is faster than PFP on 
76% of the permuted graphs. P-DBFS obtains the best performance among 
the multicore algorithms. However, its performance degrades on permuted 
graphs. Although P-PFP is more robust than P-DBFS to permutations, its 
overall performance is inferior to that of P-DBFS. P-HK is outperformed by 
the other algorithms in both sets. 




(a) Original graphs (b) Permuted graphs 

Figure 4: Performance profiles 



Figures |4(a) and |4(b)| show the performance profiles of the GPU and 
multicore algorithms. A point (x,y) in this plot means that with y prob- 
ability, the algorithm obtains a performance that is at most x times worse 
than the best running time. The plots clearly show the separation among the 
GPU algorithm and the multicore ones, especially for original graphs and 
for x < 7 for the permuted ones, thus marking GPU as the fastest in most 
cases. In particular, GPU algorithm obtains the best performance in 61% of 
the original graphs, while this ratio increases to 74% for the permuted ones. 

Figure [5] gives the overall speedups. The proposed GPU algorithm ob- 
tains average speedup values of at least 3.61 and 3.54 on, respectively, orig- 
inal and permuted graphs. The speedups increase for the hardest instances, 
where the GPU algorithm achieves 3.96 and 9.29 speedup, respectively, on 
original and permuted graphs. Table [2] gives the actual running time for 
O_Hardest20 sets for the best GPU and multicore algorithms, together with 
the sequential algorithms (the running time of all mentioned algorithms 
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Figure 5: Overall speedup of the proposed GPU algorithm w.r.t. PFP (left 
bars) and HK (right bars) algorithms. 

on the complete set of graphs can be found at |http : / /bmi . osuT edu/hpc/ 
|sof tware/ma tchm aker2/maxCar dMat cn~.html [ ). As seen from this table, ex- 
cept six instances among the original graphs and another two among the 
permuted graphs, the GPU algorithm is faster than the best sequential al- 
gorithm. It is also faster than the multicore ones in all, except five original 
graphs. 



Table 2: The actual running time of each algorithm for the O_Hardest20 
set. 





Original graphs 


Permuted graphs 


Matrix name 


GPU 


P-DBFS 


PFP 


HK 


GPU 


P-DBFS 


PFP 


HK 


roadNet-CA 


0.34 


0.53 


0.95 


2.48 


0.39 


1.88 


3.05 


4.89 


delaunay_n23 


0.96 


1.26 


2.68 


1.11 


0.90 


5.56 


3.27 


14.34 


coPapersDBLP 


0.42 


6.27 


3.11 


1.62 


0.38 


1.25 


0.29 


1.26 


kron_g500-logn21 


0.99 


1.50 


5.37 


4.73 


3.71 


4.01 


64.29 


16.08 


amazon-2008 


0.11 


0.18 


6.11 


1.85 


0.41 


1.37 


61.32 


4.69 


delaunay_n24 


1.98 


2.41 


6.43 


2.22 


1.86 


12.84 


6.92 


35.24 


as-Skitter 


0.49 


1.89 


7.79 


3.56 


3.27 


5.74 


472.63 


29.63 


amazon0505 


0.18 


22.70 


9.05 


1.87 


0.24 


15.23 


17.59 


2.23 


wikipedia-20070206 


1.09 


5.24 


11.98 


6.52 


1.05 


5.99 


9.74 


5.73 


Hamrle3 


1.36 


2.70 


0.04 


12.61 


3.85 


7.39 


37.71 


57.00 


hugetrace-00020 


7.90 


393.13 


15.95 


15.02 


1.52 


9.97 


8.68 


38.27 


hugebubbles-00000 


13.16 


3.55 


19.81 


5.56 


1.80 


10.91 


10.03 


38.97 


wb-edu 


33.82 


8.61 


3.38 


20.35 


17.43 


20.10 


9.49 


51.14 


rgg_n_2_24_s0 


3.68 


2.25 


25.40 


0.12 


2.20 


12.50 


5.72 


31.78 


patents 


0.88 


0.84 


92.03 


16.18 


0.91 


0.97 


101.76 


18.30 


italy_osm 


5.86 


1.20 


1.02 


122.00 


0.70 


3.97 


6.24 


18.34 


soc-LiveJournall 


3.32 


14.35 


243.91 


21.16 


3.73 


7.14 


343.94 


20.71 


ljournal-2008 


2.37 


10.30 


360.31 


17.66 


6.90 


7.58 


176.69 


23.45 


europe.osm 


57.53 


11.21 


14.15 


1911.56 


7.21 


37.93 


68.18 


197.03 


com- livej ournal 


4.58 


22.46 


2879.36 


34.28 


5.88 


17.19 


165.32 


29.40 



12 



5 Concluding remarks 



We proposed a parallel BFS based GPU implementation of maximum cardi- 
nality matching algorithm for bipartite graphs. We presented experiments 
on various datasets, and compared the performance of the proposed GPU 
implementation against sequential and multicore algorithms. The experi- 
ments showed that the proposed GPU implementations are faster than the 
existing parallel multicore implementations. The speedups achieved with re- 
spect to well-known sequential implementations varied from 0.03 to 629.19, 
averaging 9.29 on a set of 20 hardest problems with respect to the fastest 
sequential algorithm. A GPU is a restricted memory device. Although, 
an out-of-core or distributed-memory type algorithm is amenable when the 
graph does not fit into the device, a direct implementation of these algo- 
rithms will surely not be efficient. We plan to investigate the techniques 
to obtain good matching performance for extreme-scale bipartite graphs on 
GPUs. 

References 

[1] Azad, A., Halappanavar, M., Rajamanickam, S., Boman, E.G., Khan, 
A., Pothen, A.: Multithreaded algorithms for maximum matching in 
bipartite graphs. In: 26th IPDPS. pp. 860-872. IEEE (2012) 

[2] Azad, A., Pothen, A.: Multithreaded algorithms for matching in graphs 
with application to data analysis in flow cytometry. In: 26th IPDPS 
Workshops & PhD Forum, pp. 2494-2497. IEEE (2012) 

[3] Berge, C: Two theorems in graph theory. P. Natl. Acad. Sci. USA 43, 
842-844 (1957) 

[4] Burkard, R., Dell'Amico, M., Martello, S.: Assignment Problems. 
SIAM, Philadelphia, PA, USA (2009) 

[5] Gatalyiirek, U.V., Deveci, M., Kaya, K., Ugar, B.: Multithreaded clus- 
tering for multi-level hypergraph partitioning. In: 26th IPDPS. pp. 
848-859. IEEE (2012) 

[6] Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. 
ACM T. Math. Software 38(1), 1:1-1:25 (2011) 

[7] Duff, I.S., Erisman, A.M., Reid, J.K.: Direct Methods for Sparse Ma- 
trices. Oxford University Press, London (1986) 

[8] Duff, I.S., Kaya, K., Ucar, B.: Design, implementation, and analysis of 
maximum transversal algorithms. ACM T. Math. Software 38(2), 13 
(2011) 



13 



[9] Duff, I.S., Wiberg, T.: Remarks on implementation of 0(n 1 / 2 r) assign- 
ment algorithms. ACM T. Math. Software 14(3), 267-287 (1988) 

[10] Fagginger Auer, B., Bisseling, R.: A GPU algorithm for greedy graph 
matching. Facing the Multicore-Challenge II pp. 108-119 (2012) 

[11] Goldberg, A., Kennedy, R.: Global price updates help. SIAM J. Dis- 
crete Math. 10(4), 551-572 (1997) 

[12] Goldberg, A.V., Tarjan, R.E.: A new approach to the maximum flow 
problem. J. Assoc. Comput. Mach. 35, 921-940 (1988) 

[13] Hochbaum, D.S.: The pseudoflow algorithm and the pseudoflow-based 
simplex for the maximum flow problem. In: 6th IPCO, pp. 325-337. 
London, UK (1998) 

[14] Hopcroft, J.E., Karp, R.M.: An n 5 / 2 algorithm for maximum matchings 
in bipartite graphs. SIAM J. Comput. 2(4), 225-231 (1973) 

[15] John, P.E., Sachs, H., Zheng, M.: Kekule patterns and Clar patterns 
in bipartite plane graphs. J. Chem. Inf. Comp. Sci. 35(6), 1019-1021 
(1995) 

[16] Kaya, K., Langguth, J., Manne, F., Ucar, B.: Push- relabel based al- 
gorithms for the maximum transversal problem. Comput. Oper. Res. 
40(5), 1266-1275 (2012) 

[17] Kim, W.Y., Kak, A.C.: 3-D object recognition using bipartite matching 
embedded in discrete relaxation. IEEE T. Pattern Anal. 13(3), 224 -251 
(1991) 

[18] Vasconcelos, C, Rosenhahn, B.: Bipartite graph matching computation 
on GPU. In: Energy Minimization Methods in Computer Vision and 
Pattern Recognition, pp. 42-55. Springer (2009) 



14 



